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Abstract 


Our goal is to deploy a high-accuracy system starting with zero training examples. 
We consider an on-the-job setting, where as inputs arrive, we use real-time crowd¬ 
sourcing to resolve uncertainty where needed and output our prediction when con¬ 
fident. As the model improves over time, the reliance on crowdsourcing queries 
decreases. We cast our setting as a stochastic game based on Bayesian decision 
theory, which allows us to balance latency, cost, and accuracy objectives in a prin¬ 
cipled way. Computing the optimal policy is intractable, so we develop an approx¬ 
imation based on Monte Carlo Tree Search. We tested our approach on three 
datasets—named-entity recognition, sentiment classification, and image classifi¬ 
cation. On the NER task we obtained more than an order of magnitude reduction 
in cost compared to full human annotation, while boosting performance relative to 
the expert provided labels. We also achieve a 8% Fi improvement over having a 
single human label the whole set, and a 28% Fi improvement over online learning. 


“Poor is the pupil who does not surpass his master.” 

- Leonardo da Vinci 


1 Introduction 

There are two roads to an accurate AI system today; (i) gather a huge amount of labeled training 
data Ul and do supervised learning El; or (ii) use crowdsourcing to directly perform the task Em. 
However, both solutions require non-trivial amounts of time and money. In many situations, one 
wishes to build a new system — e.g., to do Twitter information extraction Q to aid in disaster relief 
efforts or monitor public opinion — but one simply lacks the resources to follow either the pure ML 
or pure crowdsourcing road. 

In this paper, we propose a framework called on-the-job learning (formalizing and extending ideas 
first implemented in ||6l), in which we produce high quality results from the start without requiring 
a trained model. When a new input arrives, the system can choose to asynchronously query the 
crowd on parts of the input it is uncertain about (e.g. query about the label of a single token in a 
sentence). After collecting enough evidence the system makes a prediction. The goal is to maintain 
high accuracy by initially using the crowd as a crutch, but gradually becoming more self-sufficient 
as the model improves. Online learning Q and online active learning ll8l|9][T0l are different in that 
they do not actively seek new information prior to making a prediction, and cannot maintain high 
accuracy independent of the number of data instances seen so far. Active classification im, like us. 
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Figure 1: Named entity recognition on tweets in on-the-job learning. 


strategically seeks information (by querying a subset of labels) prior to prediction, but it is based on 
a static policy, whereas we improve the model during test time based on observed data. 

To determine which queries to make, we model on-the-job learning as a stochastic game based on 
a CRF prediction model. We use Bayesian decision theory to tradeoff latency, cost, and accuracy 
in a principled manner. Our framework naturally gives rise to intuitive strategies: To achieve high 
accuracy, we should ask for redundant labels to offset the noisy responses. To achieve low latency, 
we should issue queries in parallel, whereas if latency is unimportant, we should issue queries se¬ 
quentially in order to be more adaptive. Computing the optimal policy is intractable, so we develop 
an approximation based on Monte Carlo tree search Ha and progressive widening to reason about 
continuous time [?]. 

We implemented and evaluated our system on three different tasks: named-entity recognition, sen¬ 
timent classification, and image classification. On the NER task we obtained more than an order of 
magnitude reduction in cost compared to full human annotation, while boosting performance rela¬ 
tive to the expert provided labels. We also achieve a 8 % FI improvement over having a single human 
label the whole set, and a 28% FI improvement over online learning. An open-source implementa¬ 
tion of our system, dubbed LENSE for “Learning from Expensive Noisy Slow Experts” is available 
at http://www.github.com/keenon/lense 


2 Problem formulation 


Consider a structured prediction problem from input x = (cci,..., x„) to output y = (j/i, ..., y„). 
For example, for named-entity recognition (NER) on tweets, x is a sequence of words in the tweet 
(e.g., “on George str.”) and y is the corresponding sequence of labels (e.g., NONE LOCATION 
LOCATION). The full set of labels of PERSON, LOCATION, RESOURCE, and NONE. 


In the on-the-job learning setting, inputs arrive in a stream. On each input x, we make zero or more 
queries qi,q 2 ,... on the crowd to obtain labels (potentially more than once) for any positions in 
X. The responses ri,r 2 ,... come back asynchronously, which are incorporated into our current 
prediction model pg. |Figure 2| (left) shows one possible outcome: We query positions qi — 2 
(“George”) and (72 = 3 (“str”). The first query returns ri = LOCATION, upon which we make 
another query on the the same position (73 = 3 (“George”), and so on. When we have sufficient 
confidence about the entire output, we return the most likely prediction y under the model. Each 
query qi is issued at time Si and the response comes back at time Assume that each query costs 
m cents. Our goal is to choose queries to maximize accuracy, minimize latency and cost. 


We make several remarks about this setting: First, we must make a prediction y on each input x in 
the stream, unlike in active learning, where we are only interested in the pool or stream of examples 
for the purposes of building a good model. Second, we evaluate on accuracy(y, y) against the 
true label sequence y (on named-entity recognition, this is the Fi metric), but y is never actually 
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(a) Incorporating information from responses. The bar graphs 
represent the marginals over the labels for each token (indicated 
by the first character) at different points in time. The two time¬ 
lines show how the system updates its confidence over labels 
based on the crowd’s responses. The system continues to issue 
queries until it has sufficient confidence on its labels. See the 
paragraph on behavior in ! 


Section 


for more information. 


(b) Game tree. An example of a partial 
game tree constructed by the system when 
deciding which action to take in the state 
O' = (1, (3), (0), (0), (0)), i.e. the query 
gi = 3 has already been issued and the 
system must decide whether to issue an¬ 
other query or wait for a response to qi. 


Figure 2; Example behavior while running structure prediction on the tweet “Soup on George str.” 
We omit the RESOURCE from the game tree for visual clarity. 


observed—the only feedback is via the responses, like in partial monitoring games Ha. Therefore, 
we must make enough queries to garner sufficient confidence (something we can’t do in partial 
monitoring games) on each example from the beginning. Finally, the responses are used to update 
the prediction model, like in online learning. This allows the number of queries needed (and thus 
cost and latency) to decrease over time without compromising accuracy. 

3 Model 

We model on-the-job learning as a stochastic game with two players: the system and the crowd. 
The game starts with the system receiving input x and ends when the system turns in a set of labels 
y — {yi,..., yn). During the system’s turn, the system may choose a query action q G {1,..., n} 
to ask the crowd to label yq. The system may also choose the wait action {q = 0w) to wait for the 
crowd to respond to a pending query or the return action (q = 0 ft) to terminate the game and return 
its prediction given responses received thus far. The system can make as many queries in a row (i.e. 
simultaneously) as it wants, before deciding to wait or turn inQ When the wait action is chosen, 
the turn switches to the crowd, which provides a response r to one pending query, and advances 
the game clock by the time taken for the crowd to respond. The turn then immediately reverts back 
to the system. When the game ends (the system chooses the return action), the system evaluates a 
utility that depends on the accuracy of its prediction, the number of queries issued and the total time 
taken. The system should choose query and wait actions to maximize the utility of the prediction 
eventually returned. 

In the rest of this section, we describe the details of the game tree, our choice of utility and specify 
models for crowd responses, followed by a brief exploration of behavior admitted by our model. 


Game tree. Let us n ow formalize the game tree in terms of its states, actions, transitions and 
rewards; see Figure 2b for an example. The game state a = (fnow, q, s, r, t) consists of the current 
time fnow. the actions q = (qi,..., qk-i) that have been issued at times s = (si,..., Sk-i) and the 
responses r = (ri,..., Vk-i) that have been received at times t = (fi,..., tk-i)- Let Vj = 0 and 
tj = 0 iff qj is not a query action or its responses have not been received by time fnow 


During the system’s turn, when the system chooses an action qk, the state is updated to a' = 
(f„ow, q', s', where q' = (gi,..., q*,), s' = (si,..., (now), r' = (ri,..., 0) and 


* This rules out the possibility of launching a query midway through waiting for the next response. However, 
we feel like this is a reasonable limitation that significantly simplifies the search space. 
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t' = {ti, 0). If Qk G {1,... n}, then the system chooses another action from the new state 
a'. If Qk = 0VK, the crowd makes a stochastic move from a'. Finally, if qk = 0 r, the game ends, 
and the system returns its best estimate of the labels using the responses it has received and obtains 
a utility U(a) (defined later). 

Let F = {1 < j < fc — 1 I (jfj 0w A Tj = 0} be the set of in-flight requests. During the crowd’s 
turn (i.e. after the system chooses 0w)^ the next response from the crowd, j* G F, is chosen; j* = 
arg min^-g^ f' where t'j is sampled from the response-time model, ^ I > ^now), for 

each j G F. Finally, a response is sampled using a response model, r'j, ^ PWj* I f )> the state 

is updated to a' = (tj* , q, s, r', t'), where r' = (ri,..., r' .,..., Vk) and t' = (fi,..., f'-.,..., tk). 

Utility. Under Bayesian decision theory, the optimal choice for an action in state cr = 
(fnow, q, r 7 s, t) is the one that attains the maximum expected utility (i.e. value) for the game starting 
at a. Recall that the system can return at any time, at which point it receives a utility that trades 
off two things; The first is the accuracy of the MAP estimate according to the model’s best guess 
of y incorporating all responses received by time r. The second is the cost of making queries; a 
(monetary) cost wm per query made and penalty of wj per unit of time taken. Formally, we define 
the utility to be; 


U{a) = ExpAcc(p(y | x, q,s,r,t)) - (ngwivi + fnowWx), (1) 

ExpAcc(p) = Ep(y)[Accuracy(argmaxp(y'))], (2) 

where uq = \{j \ qj G {1,..., n}| is the number of queries made, p(y | x, q, s, r, t) is a prediction 
model that incorporates the crowd’s responses. 

The utility of wait and return actions is computed by taking expectations over subsequent trajectories 
in the game tree. This is intractable to compute exactly, so we propose an approximate algorithm in 
[Section I4l 


Environment model. The final component is a model of the environment (crowd). Given input 
X and queries q = {qi,..., qk) issued at times s = (si,.. ., Sk), we define a distribution over the 
output y, responses r = (ri,..., r^) and response times t = (G,..., as follows; 

k 

P(y,r,t I x,q,s) =pe(y | x) | yq,)pT{U \ s*). (3) 


The three components are as follows; pe(y | x) is the prediction model (e.g. a standard linear-chain 
CRE); pY^{r \ yq) is the response model which describes the distribution of the crowd’s response 
r for a given a query q when the true answer is pq, and pT(fi | si) specifies the latency of query 
qi. The CRE model pe{y \ x) is learned based on all actual responses (not simulated ones) using 
AdaGrad. To model annotation errors, we set pg_{r \ pq) — 0.7 iff r — y^ljand distribute the 
remaining probability for r uniformly. Given this full model, we can compute ^r' \ x, r, q) simply 
by marginalizing out y and t from Equation ^ When conditioning on r, we ignore responses that 
have not yet been received (i.e. when rj = 0 for some j). 


Behavior. Let’s look at typical behavior that we expect the model and utility to capture. Eigure 2a 
shows how the marginals over the labels change as the crowd provides responses for our running 
example, i.e. named entity recognition for the sentence “Soup on George str.”. In the both timelines, 
the system issues queries on “Soup” and “George” because it is not confident about its predictions 
for these tokens. In the first timeline, the crowd correctly responds that “Soup” is a resource and 
that “George” is a location. Integrating these responses, the system is also more confident about 
its prediction on “str.”, and turns in the correct sequence of labels. In the second timeline, a crowd 
worker makes an error and labels “George” to be a person. The system still has uncertainty on 
“George” and issues an additional query which receives a correct response, following which the 
system turns in the correct sequence of labels. While the answer is still correct, the system could 
have taken less time to respond by making an additional query on “George” at the very beginning. 


^We found the humans we hired were roughly 70% accurate in our experiments 
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4 Game playing 


In Section 3 we modeled on-the-job learning as a stochastic game played between the system and 


the crowd. We now turn to the problem of actually finding a policy that maximizes the expected 
utility, which is, of course, intractable because of the large state space. 


Our algorithm ( [Algorithm |l| i combines ideas from Monte Carlo tree search ifT^ to systematically 
explore the state space and progressive widening [?] to deal with the challenge of continuous vari¬ 
ables (time). Some intuition about the algorithm is provided below. When simulating the system’s 
turn, the next state (and hence action) is chosen using the upper confidence tree (UCT) decision 
rule that trades off maximizing the value of the next state (exploitation) with the number of visits 
(exploration). The crowd’s turn is simulated based on transitions defined in |Section|3 To handle the 
unbounded fanout during the crowd’s turn, we use progressive widening that maintains a current set 
of “active” or “explored” states, which is gradually grown with time. Let N{a) be the number of 
times a state has been visited, and C{a) be all successor states that the algorithm has sampled. 


Algorithm 1 Approximating expected utility with MCTS and progressive widening 

1: For all a, N{a) ^ 0, V(a) •(— 0, C{a) •(— [] o Initialize visits, utility sum, and children 

2: function MONTECARLOVALUE(state cr) 

3: increment A^(cr) 

4: if system’s turn then 

5: cr' ^ arg max^, | o Choose next state tr' using UCT 

6: V ^MONTECARLOVALUE(cr') 

7: V (cr) V (cr) + v > Record observed utility 

8 : return v 

9: else if crowd’s turn then 

10: if max(l, ^yN{a)) < |C'(cr)| then > Restrict continuous samples using PW 

11: cr' is sampled from set of already visited C{a) based on (|^ 

12: else 

13: cr' is drawn based on (^ 

14: C{(j)^C{a)\J{[a'\] 

15: end if 

16: return MONTECARLOVALUE(cr') 

17: else if game terminated then 

18: return utility U of a according to ([T]) 

19: end if 

20: end function 


5 Experiments 


In this section, we empirically evaluate our approach on three tasks. While the on-the-job setting we 
propose is targeted at scenarios where there is no data to begin with, we use existing labeled datasets 
(Table 11 to have a gold standard. 


Baselines. We evaluated the following four methods on each dataset: 

1 . Human n-query: The majority vote of n human crowd workers was used as a prediction. 

2. Online learning: Uses a classifier that trains on the gold output for all examples seen so 
far and then returns the MLE as a prediction. This is the best possible offline system; it 
sees perfect information about all the data seen so far, but can not query the crowd while 
making a prediction. 

3. Threshold haseline: Uses the following heuristic: For each label, we ask for m queries 
such that {l—pg{yi \ x)) x 0.3"* > 0.98. Instead of computing the expected marginals over 
the responses to queries in flight, we simply count the in-flight requests for a given variable, 
and reduces the uncertainty on that variable by a factor of 0.3. The system continues 
launching requests until the threshold (adjusted by number of queries in flight) is crossed. 
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Dataset (Examples) 
NER (657) 


Sentiment (1800) 


Eace (1784) 


Task and notes 

We evaluate on the CoNLL-2003 
NER tasl0 a sequence labeling 
problem over English sentences. 
We only consider the four tags cor¬ 
responding to persons, locations, 
organizations or non^ 

We evaluate on a subset of the 
IMDB sentiment dataset ifTSi that 
consists of 2000 polar movie re¬ 
views; the goal is binary classifica¬ 
tion of documents into classes POS 
and NEG. 

We evaluate on a celebrity face 
classification task ini. Each im¬ 
age must be labeled as one of the 
following four choices: Andersen 
Cooper, Daniel Craig, Scarlet Jo¬ 
hansson or Miley Cyrus. 


Features 

We used standard features HUT the 

current word, current lemma, pre¬ 
vious and next lemmas, lemmas in 
a window of size three to the left 
and right, word shape and word 
prefix and suffixes, as well as word 
embeddings. 

We used two feature sets, the 
first (UNiGRAMS) containing only 
word unigrams, and the second 
(rnn) that also contains sentence 
vector embeddings from M- 

We used the last layer of a 11- 
layer AlexNet El trained on Ima- 
geNet as input feature embeddings, 
though we leave back-propagating 
into the net to future work. 


Table 1: Datasets used in this paper and number of examples we evaluate on. 


System 

Delay/tok 

Named Entity Recognition 

Qs/tok PER El LOG Fi 

ORG El 

El 

Face Identification 
Latency Qs/ex Acc. 

1-vote 

467 ms 

1.0 

90.2 

78.8 

71.5 

80.2 

1216 ms 

1.0 

93.6 

3-vote 

750 ms 

3.0 

93.6 

85.1 

74.5 

85.4 

1782 ms 

3.0 

99.1 

5-vote 

1350 ms 

5.0 

95.5 

87.7 

78.7 

87.3 

2103 ms 

5.0 

99.8 

Online 

n/a 

n/a 

56.9 

74.6 

51.4 

60.9 

n/a 

n/a 

79.9 

Threshold 

414 ms 

0.61 

95.2 

89.8 

79.8 

88.3 

1680 ms 

2.66 

93.5 

DENSE 

267 ms 

0.45 

95.2 

89.7 

81.7 

88.8 

1590 ms 

2.37 

99.2 


Table 2: Results on NER and Eace tasks comparing latencies, queries per token (Qs/tok) and perfor¬ 
mance metrics (Ei for NER and accuracy for Eace). 


Predictions are made using MLE on the model given responses, 
reason about time and makes all its queries at the very beginning. 


4. DENSE: Our full system as described in Section 3 


The baseline does not 


Implementation and crowdsourcing setup. We implemented the retainer model of ifT^ on Ama¬ 
zon Mechanical Turk to create a “pool” of crowd workers that could respond to queries in real-time. 
The workers were given a short tutorial on each task before joining the pool to minimize systematic 
errors caused by misunderstanding the task. We paid workers $1.00 to join the retainer pool and 
an additional $0.01 per query (for NER, since response times were much faster, we paid $0,005 
per query). Worker response times were generally in the range of 0.5-2 seconds for NER, 10-15 
seconds for Sentiment, and 1-4- seconds for Eaces. 

When running experiments, we found that the results varied based on the current worker quality. To 
control for variance in worker quality across our evaluations of the different methods, we collected 
5 worker responses and their delays on each label ahead of tim^ During simulation we sample the 
worker responses and delays without replacement from this frozen pool of worker responses. 


Summary of results. Table 2 and Table 3 summarize the performance of the methods on the three 


tasks. On all three datasets, we found that on-the-job learning outperforms machine and human-only 


^ http://www.cnts.ua.ac.be/conll2003/ner/ 

The original also includes a fifth tag for miscellaneous, however the definition for miscellaneos is complex, 
making it very difficult for non-expert crowd workers to provide accurate labels. 

^ These datasets are available in the code repository for this paper 
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Figure 3: Queries per example for LENSE on 
Sentiment. With simple UNIGRAM features, the 
model quickly learns it does not have the ca¬ 
pacity to answer confidently and must query the 
crowd. With more complex RNN features, the 
model learns to be more confident and queries 
the crowd less over time. 


System 

Latency 

Qs/ex 

Acc. 

1 -vote 

6.6 s 

1.00 

89.2 

3-vote 

10.9 s 

3.00 

95.8 

5-vote 

13.5 s 

5.00 

98.7 

UNIGRAMS 

Online 

n/a 

n/a 

78.1 

Threshold 

10.9 s 

2.99 

95.9 

LENSE 

11.7 s 

3.48 

98.6 

RNN 

Online 

n/a 

n/a 

85.0 

Threshold 

11 .0s 

2.85 

96.0 

LENSE 

11 .0s 

3.19 

98.6 


Table 3: Results on the Sentiment task compar¬ 
ing latency, queries per example and accuracy. 




Eigure 4; Comparing Ei and queries per token on the NER task over time. The left graph compares 
LENSE to online learning (which cannot query humans at test time). This highlights that LENSE 
maintains high Fi scores even with very small training set sizes, by falling back the crowd when it 
is unsure. The right graph compares query rate over time to 1-vote. This clearly shows that as the 
model learns, it needs to query the crowd less. 


comparisons on both quality and cost. On NER, we achieve an Ei of 88.4% at more than an order of 
magnitude reduction on the cost of achieving comporable quality result using the 5-vote approach. 
On Sentiment and Eaces, we reduce costs for a comparable accuracy by a factor of around 2. For the 
latter two tasks, both on-the-job learning methods perform less well than in NER. We suspect this 
is due to the presence of a dominant class (“none”) in NER that the model can very quickly learn to 
expend almost no effort on. LENSE outperforms the threshold baseline, supporting the importance 
of Bayesian decision theory. 


Eigure |4 tracks the performance and cost of LENSE over time on the NER task. LENSE is not only 


able to consistently outperform other baselines, but the cost of the system steadily reduces over time. 
On the NER task, we find that LENSE is able to trade off time to produce more accurate results than 
the 1-vote baseline with fewer queries by waiting for responses before making another query. 


While on-the-job learning allows us to deploy quickly and ensure good results, we would like to 
eventually operate without crowd supervision. [Eigure |3| we show the number of queries per example 
on Sentiment with two different features sets, UNIGRAMS and RNN (as described in [Table jf] ). With 
simpler features (UNiGRAMS), the model saturates early and we will continue to need to query to 
the crowd to achieve our accuracy target (as specified by the loss function). On the other hand, 
using richer features (RNN) the model is able to learn from the crowd and the amount of supervision 
needed reduces over time. Note that even when the model capacity is limited, LENSE is able to 
guarantee a consistent, high level of performance. 
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6 Related Work 


On-the-job learning draws ideas from many areas; online learning, active learning, active classifica¬ 
tion, crowdsourcing, and structured prediction. 

Online learning. The fundamental premise of online learning is that algorithms should improve 
with time, and there is a rich body of work in this area ITJ. In our setting, algorithms not only 
improve over time, but maintain high accuracy from the beginning, whereas regret bounds only 
achieve this asymptotically. 

Active learning. Active learning (see lfT9l for a survey) algorithms strategically select most in¬ 
formative examples to build a classiher. Online active learning nmsiiioi performs active learning 
in the online setting. Several authors have also considered using crowd workers as a noisy oracle 
It differs from our setup in that it assumes that labels can only be observed after 
classification, which makes it nearly impossible to maintain high accuracy in the beginning. 

Active classification. Active classification Il24ll25ll2^ asks what are the most informative features 
to measure at test time. Existing active classification algorithms rely on having a fully labeled 
dataset which is used to learn a static policy for when certain features should be queried, which does 
not change at test time. On-the-job learning differs from active classification in two respects: true 
labels are never observed, and our system improves itself at test time by learning a stronger model. 
A notable exception is Legion:AR which like us operates in on-the-job learning setting to for 
real-time activity classification. However, they do not explore the machine learning foundations 
associated with operating in this setting, which is the aim of this paper. 

Crowdsourcing. A burgenoning subset of the crowdsourcing community overlaps with machine 
learning. One example is Flock Ezl, which first crowdsources the identification of features for an 
image classification task, and then asks the crowd to annotate these features so it can learn a decision 
tree. In another line of work, TurKontrol ESl models individual crowd worker reliability to optimize 
the number of human votes needed to achieve confident consensus using a POMDP. 

Structured prediction. An important aspect our prediction tasks is that the output is structured, 
which leads to a much richer setting for one-the-job learning. Since tags are correlated, the impor¬ 
tance of a coherent framework for optimizing querying resources is increased. Making active partial 
observations on structures and has been explored in the measurements framework of ll^ and in the 
distant supervision setting i30l. 

7 Conclusion 

We have introduced a new framework that learns from (noisy) crowds on-the-job to maintain high 
accuracy, and reducing cost significantly over time. The technical core of our approach is modeling 
the on-the-job setting as a stochastic game and using ideas from game playing to approximate the 
optimal policy. We have built a system, LENSE, which obtains significant cost reductions over a 
pure crowd approach and significant accuracy improvements over a pure ML approach. 
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